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Abstract 

The B-tree is a fundamental secondary index structure that is widely used for answering one-dimensional range 
reporting queries. Given a set of A'^ keys, a range query can be answered in 0(log^ 17 + § ) I^Os, where B is the disk 
block size, K the output size, and M the size of the main memory buffer. When keys are inserted or deleted, the B- 
tree is updated in 0(log\g A'^) I/Os, if we require the resulting changes to be committed to disk right away. Otherwise, 
the memory buffer can be used to buffer the recent updates, and changes can be written to disk in batches, which 
significantly lowers the amortized update cost. A systematic way of batching up updates is to use the logarithmic 
method, combined with fractional cascading, resulting in a dynamic B-tree that supports insertions in 0(-g- log ^) 
I/Os and queries in 0(log 37 + I/Os. Such bounds have also been matched by several known dynamic B-tree 
variants in the database literature. Note that, however, the query cost of these dynamic B-trees is substantially worse 
than the 0(logg 17 + ■§) bound of the static B-tree by a factor of 6(log B). 

In this paper, we prove that for any dynamic one-dimensional range query index structure with query cost 0{q + 
^) and amortized insertion cost 0{u/B), the tradeoff q ■ \og{u/q) — Q.(\ogB) must hold if g = 0(logi3). For 
most reasonable values of the parameters, we have -jj = B'^^^\ in which case our query-insertion tradeoff implies 
that the bounds mentioned above are already optimal. We also prove a lower bound of u ■ log q — Sl(log B), which 
is relevant for larger values of q. Our lower bounds hold in a dynamic version of the indexability model, which is of 
independent interests. Dynamic indexability is a clean yet powerful model for studying dynamic indexing problems, 
and can potentially lead to more interesting complexity results. 

1 Introduction 

The B-tree [5] is a fundamental secondary index structure used in nearly all database systems. It has both very good 
space utilization and query performance: Assuming each disk block can store B data records, the B-tree occupies 
O(^) disk blocks for N data records, and supports one-dimensional range reporting queries in 0{\ogg N + ^) I/Os 
(or page accesses) where K is the output size. Due to the large fanout of the B-tree, for most practical values of N 
and B, the B-tree is very shallow and lege N is essentially a constant. Very often we also have a memory buffer of 
size M, which can be used to store the top Q{\ogg M) levels of the B-tree, further lowering the effective height of the 
B-tree to 0{\ogg jj), meaning that we can usually get to the desired leaf with merely one or two I/Os, and then start 
pulling out results. 

If one wants to update the B-tree directly on disk, it is also well known that it takes 0{\ogg N) I/Os. Things 
become much more interesting if we make use of the main memory buffer to collect a number of updates and then 
perform the updates in batches, lowering the amortized update cost significantly. For now let us focus on insertions 
only; deletions are in general much less frequent than insertions, and there are some generic methods for dealing with 
deletions by converting them into insertions of "delete signals" [2, 17]. The idea of using a buffer space to batch 
up insertions has been well exploited in the literature, especially for the purpose of managing historical data, where 
there are much more insertions than queries. The LSM-tree [17] was the first along this line of research, by applying 
the logarithmic method [7] to the B-tree. Fix a parameter 2 < I < B. It builds a collection of B-trees of sizes up to 
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M, iM, i'^M, . . . , respectively, where the first one always resides in memory. An insertion always goes to the memory- 
resident tree; if the first i trees are full, they are merged together with the (i + l)-th tree by rebuilding. Standard analysis 
shows that the amortized insertion cost is O ( log^ 17 ) ■ ^ query takes O (log^ N log^ ^ + ^ ) ^^O^ since O (log^ |j ) 
trees need to be queried. Using fractional cascading [10], the query cost can be improved to 0(log£ ^ + without 
affecting the (asymptotic) size of the index and the update cost, but this result appears to be folklore. Later Jermaine et 
al. [14] proposed the Y-tree as "yet" another B-tree structure for the purpose of lowering the insertion cost. The Y-tree 
is an £-ary tree, where each internal node is associated with a bucket storing all the elements to be pushed down to its 
subtree. The bucket is emptied only when it has accumulated il{B) elements. Although [14] did not give a rigorous 
analysis, it is not difficult to derive that its insertion cost is 0(-^ log^ and query cost 0(log^ + namely, the 
same as those of the LSM-tree with fractional cascading. Around the same time Buchsbaum et al. [9] independently 
proposed the buffered repository tree in a different context, with similar ideas and the same bounds as the Y-tree. In 
order to support even faster insertions, Jagadish et al. [13] proposed the stepped merge tree, a variant of the LSM-tree. 
At each level, instead of keeping one tree of size PM, they keep up to £ individual trees. When there are I level-i trees, 
they are merged to form a level- (i +1) tree. The stepped merge tree has an insertion cost of 0(-g- log^ jj), lower than 
that of the LSM-tree. But the query cost is a lot worse, reaching 0(£ log^ N log^ 17 + ^) since £ trees need to 
be queried at each level. Again the query cost can be improved to 0{C log^ ^ + using fractional cascading. The 
current best known results are summarized in Table 1. Typically £ is set to be a constant [13, 14, 17], at which point 
all the indexes have the same asymptotic performance of 0(log + -g-) query and log -jj) insertion. Note that 
the amortized insertion bound of these dynamic B-trees could be much smaller than one I/O, hence much faster than 
updating the B-tree directly on disk. The query cost is, however, substantially worse than the 0(log^ |j) query cost 
of the static B-tree by an 8 (log B) factor As typical values of B range from hundreds to thousands, we are expecting 
a 10-fold degradation in query performance for these dynamic B-trees. Thus the obvious question is, can we lower the 
query cost while still allowing for fast insertions? 
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Table 1 : Query/insertion upper bounds of previously known B-tree indexes, for a parameter 2 < i < B. 



In particular, the indexes listed in Table 1 are all quite practical, so one may wonder if there are some fancy 
complicated theoretical structures with better bounds that have not been found yet. For the static range query problem, 
it turned out to be indeed the case. A somehow surprising result by Alstrup et al. [1] shows that it is possible to achieve 
linear size and 0{K) query time in the RAM model. This results also carries over to external memory, yielding a disk- 
based index with O(^) blocks and 0(1 + ^)-I/0 query cost. However, this structure is overly complicated, and 
is actually worse than the B-tree in practice. In the dynamic case, a recent result by Mortensen et al. [16] gives a 
RAM-structure with O(logloglog + K) query time and O(loglogiV) update time. This result, when carried over 
to external memory, gives us an update cost of 0(loglog-/V) I/Os. This could be much worse than the 0(i log 1^) 
bound obtained by the simple dynamic B-trees mentioned earlier, for typical values of iV, M, and B. Until today no 
bounds better than the ones in Table 1 are known. The 0(log ^ + ^) query and log ^) insertion bounds seem 
to be an inherent barrier that has been standing since 1996. Nobody can break one without sacrificing the other 

Lower bounds for this and related problems have also been sought for For lower bounds we will only consider 
insertions; the results will also hold for the more general case where insertions and deletions are both present. A 
closely related problem to range queries is the predecessor problem, in which the index stores a set of keys, and 
the query asks for the preceding key for a query point. The predecessor problem has been extensively studied in 
various internal memory models, and the bounds are now tight in almost all cases [6]. In external memory, Brodal 
and Fagerberg [8] prove that for the dynamic predecessor problem, if insertions are handled in 0(;ilog^) I/Os 

amortized, a predecessor query has to take ^{j^^^^^jji)) I/Os in the worst case. Their lower bound model is a 
comparison based external memory model. However, a closer look at their proof reveals that their techniques can 
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actually be adapted to prove the same lower bound of n( 




+ ^) for range queries for any B ^ More 



precisely, we can use their techniques to get the following tradeoff: If an insertion takes amortized u/B 1/Os and a 
query takes worst-case g + O(-g-) 1/Os, then we have 



provided u < B/\og^ N and N > A'P. In addition to (1), a few other tradeoffs have also been obtained in [8] for 
the predecessor problem, but their proofs cannot be made to work for range queries. For the most interesting case 
when we require q = 0(log ^), (1) gives a meaningless bound of u — fl{l/\og^ as u > 1 trivially. In the other 
direction, if u = 0(log ^), the tradeoff (1) still leaves an 0(log log ^) gap to the known upper bound for q. 

Our results. In this paper, we prove a query-insertion tradeoff of 



for any dynamic range query index with a query cost of q + 0{K/B) and an amortized insertion cost of u/B, 
provided N > 2MB^. For most reasonable values of N, AI, and B, we may assume that |j = B'-^^^\ or equivalently 
that the B-tree built on N keys has constant height. In this case if we require q — 0(log jj) — 0{\ogB), the 
first branch of (2) gives u = n{\ogB), matching the known upper bounds in Table 1. In the other direction, if 
u = 0(log |j) = 0{logB), we have q = n{logB) = fl{log jj), which is again tight, and closes the 9(loglog |j) 
gap left in [8]. In fact for any 2 < £ < B, if u = 0{£ log^ B), we have a tight lower bound q ~ fi(log£ B), matching 
the bounds in the first row of Table 1 . The second branch of (2) is relevant for larger values of q, for which the previous 
tradeoff (1) is helpless. In particular, if u = 0{\ogg ^) = 0(1), we have q = B^^<-^1 This means that if we want 
to support very fast insertions, the query cost has to go from logarithmic to polynomial, an exponential blowup. This 
matches the second row of Table 1. Our results show that all the indexes listed in Table 1, which are all quite simple 
and practical, are already essentially the best one can hope for. 

More interestingly, our lower bounds hold in a dynamic version of the indexability model [11], which was originally 
proposed by Hellerstein, Koutsoupias, and Papadimitriou [12]. To date, nearly all the known lower bounds for indexing 
problems are proved in this model [3,4, 11, 15, 18]. It is in some sense the strongest possible model for reporting 
problems. It basically assumes that the query cost is only determined by the number of disk blocks that hold the actual 
query results, and ignores all the search cost that we need to pay to find these blocks. Consequently, lower bounds 
obtained in this model are also stronger than those obtained in other models. We will give more details on this model 
in Section 2. However, until today this model has been used exclusively for studying static indexing problems and 
only in two or higher dimensions. In one dimension, the model yields trivial bounds (see Section 2 for details). In 
the JACM article [11] that summarizes most of the results on indexability, the authors state: "However, our model 
also ignores the dynamic aspect of the problem, that is, the cost of insertion and deletion. Its consideration could be a 
source of added complexity, and in a more general model the source of more powerful lower bounds." In this respect, 
another contribution of this paper is to add dynamization to the model of indexability, making it more powerful and 
complete. In particular, our lower bound results suggest that, although static indexability is only effective in two or 
more dimensions, dynamization makes it a suitable model for one-dimensional indexing problems as well. 

2 Dynamic Indexability 

Static indexability. We first briefly review the framework of indexabiUty before introducing its dynamization. We 
follow the notations from [11]. A workload is a tuple W = {D,I, Q) where D is a possibly infinite set (the 
domain), I C D is a finite subset of D (the instance), and Q is a set of subsets of / (the query set). For example, for 
one-dimensional range queries, D is the real line, / is a set of points on the line, and Q consists of all the contiguous 
subsets of /. We usually use N — |/| to denote the number of objects in the instance. An indexing scheme S ~ {W, B) 
consists of a workload W and a set B of B-subsets of / such that B covers /. The _B-subsets of B model the data blocks 
of an index structure, while any auxiliary structures connecting these data blocks (such as pointers, splitting elements) 
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q ■ \og{u/q) = il(log B), for q < a\n B, where a is any constant; 
u ■ logq ^ il{log B), for all g. 
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are ignored from this framework. The size of the indexing scheme is \B\, the number of blocks. In [1 1], an equivalent 
parameter, the redundancy r = B\B\/N is used to measure the space complexity of the indexing scheme. The cost of 
a query g G Q is the minimum number of blocks whose union covers q. Note that here we have implicitly assumed 
that the query algorithm can find these blocks to cover q instantly with no cost, essentially ignoring the "search cost". 
The access overhead A is the minimum A such that any query q £ Q has a cost at most A - \\q\/B~\. Note that [ | (7 1 /B] 
is the minimum number of blocks to report the objects in q, so the access overhead A measures how much more we 
need to access the blocks in order to retrieve q. For some problems using a single parameter for the access overhead is 
not expressive enough, and we may split it into two: one that depends on\q\ and another that does not. More precisely, 
an indexing scheme with access overhead {Aq, Ai) must answer any query q E Q with cost at most Ao + Ai- \\q\/B~\ 
[4]. We can see that the indexability model is very strong. It is the strongest possible model that one can conceive for 
reporting problems. It is generally accepted that no index structure could break indexability lower bounds, unless it 
somehow "creates" objects without accessing the original ones or their copies. 

Except for some trivial facts, all the lower bound results obtained under this model are expressed as a tradeoff be- 
tween r and A (or {Ao , ^1 )). For example, two-dimensional range reporting has a tradeoff of r = ri(log( A^/ B) j log A) 
[3, 11]; for the point enclosure problem, the dual of range queries, we have the tradeoff Ao = ri(log(iV/ _B) / log r) 
[4]. These results show that, even if we ignore the search cost, we can obtain nontrivial lower bounds for these prob- 
lems. These lower bounds have also been matched with corresponding indexes that do include the search cost for 
typical values of r and A [3,4]. This means that the inherent difficulty for these indexing problems roots from how 
we should layout the data objects on disk, not the search structure on top of them. By ignoring the search component 
of an index, we obtain a simple and clean model, which is still powerful enough to reveal the inherent complexity of 
indexing. It should be commented that the indexability model is very similar in spirit to the cell probe model of Yao 
[19], which has been successfully used to derive many internal memory lower bounds. But the two models are also 
different in some fundamental ways; please see [11] for a discussion. 

Nevertheless, although the indexability model is appropriate for two-dimensional problems, it seems to be overly 
strong for the more basic one-dimensional range query problem. In one dimension, we could simply layout all the 
points in order sequentially on disk, which would give us a linear-size, constant-query access overhead index! This 
breaks the 0(log^ N) bound of the good old B-tree, and suggests that the indexability model may be too strong for 
studying one-dimensional workloads. This in fact can be explained. The fi(log^ N) lower bound holds only in some 
restrictive models, such as the comparison model, and the B-tree indeed only uses comparisons to guide its search. 
As we mentioned in the introduction, if we are given more computational power (such as direct addressing), we can 
actually solve the static ID range query problem with an index of linear size and 0{ \K/ B~\ )-I/0 query cost [1]. This 
means that the search cost for ID range queries can still be ignored without changing the complexity of the problem, 
and the indexability model is still appropriate, albeit it only gives a trivial lower bound. 

Dynamic indexability. In the dynamic case, the domain D remains static, but the instance set / could change. 
Correspondingly, the query set Q changes and the index also updates its blocks B to cope with the changes in /. 
In the static model, there is no component to model the main memory, which is all right since the memory does 
not help reduce the worst-case query cost anyway. However, in the dynamic case, the main memory does improve 
the (amortized) update cost significantly by buffering the recent updates. So we have to include a main memory 
component in the indexing scheme. More precisely, the workload W is defined as before, but an indexing scheme is 
now defined as 5 = {W, B, A4) where is a subset of / with size at most M such that the blocks of B together with 
M cover /. The redundancy r is defined as before, but the access overhead A is now defined as the minimum A such 
that any q £ Q can be covered by A4 and at most A ■ \\q\/B~\ blocks from B. 

We now define the dynamic indexing scheme. Here we only consider insertions; deletions can be incorporated 
similarly. We first define the dynamic workload. 

Definition 1 A dynamic workload W is a sequence of N workloads Wi = {D, Ii, Qi), . . . , Wn ~ {D, I2, Q2) such 
that |/j I = i and I, C h+i for i = 1, . . . , N ~ 1. 

Essentially, we insert N objects into / one by one, resulting in a sequence of workloads. Meanwhile, the query set Q 
changes according to the problem at hand. 

Definition 2 For a given dynamic workload W = {Wi, . . . , Wn), a dynamic indexing scheme § is a sequence of 
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N indexing schemes Si ~ {Wi,Bi, Mi), . . . , Sn = {Wn,Bn, M n)- Each Si is called a snapshot of §. § has 
redundancy r and access overhead A if for all i. Si has redundancy at most r and access overhead at most A. 

A third parameter u, the update cost, is defined as follows. 

Definition 3 Given a dynamic indexing scheme S, the transition cost from Si to Si+i is \Bi — Bi+i\ + — Bi\, 

i.e., the number of blocks that are different in Bi and Bi+i. The update cost § is the u such that the sum of all the 
transition costs for all 1 < i < — 1 is u • N/B. 

Note that the update cost as defined above is the amortized cost for handling B updates. This is mainly for convenience 
so that u is always at least 1 . 

Our definition of the dynamic indexability model continues the same spirit as in the static case: We will only focus 
on the cost associated with the changes in the blocks holding the actual data objects, while ignoring the search cost of 
how to find these blocks to be changed. Under this framework, the main result obtained in this paper is the following 
tradeoff between u and A. 

Tlieorem 1 Let § be any dynamic indexing scheme for dynamic one -dimensional range queries with access overhead 
A and update cost u. Provided N > 2MB^, we have 

{A ■ \og{u/A) = il(log_B), for A < a\n B, where a is any constant; 
u-logA^n{logB), for all A. 

Note that this lower bound does not depend on the redundancy r, meaning that the index cannot do better by consuming 
more space. Interestingly, our result shows that although the indexability model is basically meaningless for static ID 
range queries, it gives nontrivial and almost tight lower bound when dynamization is considered. 

To prove Theorem 1, below we first define a ball-shuffling problem and show that any dynamic indexing scheme 
for ID range queries yields a solution to the ball-shuffling problem. Then we prove a lower bound for the latter. 

3 The Ball-Shuffling Problem and the Reduction 

We now define the ball-shuffling problem, and present a lower bound for it. There are n balls and t bins, bi, . . . ,bt- 
The balls come one by one. Upon the arrival of each ball, we need to find some bin hi to put it in. Abusing notations, 
we use also hi to denote the current size of the bin, i.e., the number of balls inside. The cost of putting the ball into bi 
is defined to be 6i + 1. Instead of directly putting a ball into a bin, we can do so with shuffling: We first collect all the 
balls from one or more bins, add the new ball to the collection, and then arbitrarily allocate these balls into a number 
of empty bins. The cost of this operation is the total number of balls involved, i.e., if / denotes the set of indices of 
the bins collected, the cost is J^iei ^« ^- Note that directly putting a ball into a bin can be seen as a special shuffle, 
where we collect balls from only one bin and allocate the balls back to one bin. 

Our main result for the ball-shuffling problem is the following lower bound, whose proof is deferred to Section 4. 

Tlieorem 2 The cost of any algorithm for the ball-shuffling problem is at least (i) fl{nlog^n) for any t; and (ii) 
fl{tn^'^^^^'^/'^^ ) for t < alnn where a is an arbitrary constant. 

The reduction. Suppose there is a dynamic indexing scheme § = {Si, . . . , Sn) for dynamic one-dimensional range 
queries with update cost u and access overhead A. Assuming N > 2MB^, we will show how this leads to a solution 
to the ball-shuffling problem on n = B balls and t = A bins with cost 0{uB). This will immediately translate the 
tradeoff in Theorem 2 to the desired tradeoff in Theorem 1 . 

We divide these N points into subsets of 2MB^. We will use a separate construction for each subset of points. 
Since the amortized cost for handling every B insertions of points is u, at least one of the subsets has a total transition 
cost of at most 0{uM B). Let us consider one such subset of A^' = 2M B^ points. 

We construct a dynamic workload of A^' points as follows. The points are divided into 2MB groups of B each. 
The coordinates of all points in the j-th group are in the range of {j, j + 1) and distinct. We perform the insertions in 
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B rounds; in each round, we simply add one point to each group. The dynamic indexing scheme § correspondingly 
has A^' snapshots 5i = {Wi,Bi, Aii), . . . ,Sn' = {Wn' , Bn' , Mn')- We will only consider the subsequence §' 
consisting of the snapshots S2mb,<S2-2MB, ■ ■ ■ , Sn', i-S-, the ones after every round. The total transition cost of this 
subsequence is obviously no higher than that of the entire sequence. Recall that the transition cost from a snapshot 
S = {W, B, Ai) to its succeeding snapshot S' ~ {W , B', Ai') is the number of blocks that are different in B and 
B'. We now define the element transition cost to be the number of elements in these different blocks, more precisely, 
\{x I X G h,b G {B — B')iJ{B' ~ B)}\. Since each block contains at most _B elements, the element transition cost is at 
most a factor 0{B) larger than the transition cost. Thus, S' has an element transition cost of 0{uMB^). The element 
transition cost can be associated with the elements involved, that is, it is the total number of times that an element has 
been in an updated block, summed over all elements. 

If a group G has at least one point in some A4i in then it is said to be contaminated. Since J2f=i l-^i-2j\/B| < 
M B, at most M B groups are contaminated. Since the total element transition cost of §' is 0{uM B^), among the at 
least MB uncontaminated groups, at least one has an element transition cost of 0{uB). Focusing on such a group, 
and let Gi, Gb be the snapshots of this group after every round. Since this group is uncontaminated, all points 
in Gi must be completely covered by Bi.2MB for all i = 1, . . . , B. Since Gi has at most B points and S has access 
overhead A, Gi should always be covered by at most A blocks in Bi.2MB- For each i, let bi,i, • ■ • , bi,A be the blocks 
of Bi.2MB that cover Gi, let hij = hij n Gi, j ~ 1, . . . ,A. Note that these hi,j may overlap and some of them may 
be empty. Let Bi = . . . , ^i.^i}- Consider the transition from Bi to Bi+i. We can as before define its element 
transition cost as |{a; | a; G fe, 6 G [Bi — Bi+i) U (;Bi+i — This element transition cost cannot be higher than that 
from Bi.2MB to iB(i+i).27\/B only counting the elements of Gi+i, because hi^j ^ biji only if bij ^ h.j'- Therefore, 
the total element transition cost of the sequence Bi, . . . , Bb i& st most 0{uB). 

Now we claim that the sequence Bi, . . . , Bb gives us a solution for the ball-shuffling problem of B balls and A 
bins with cost at most its element transition cost. To see this, just treat each set in Bi as a bin in the ball-shuffling 
problem. To add the (i + l)-th ball, we shuffle the bins in — Si+i and allocate the balls according to the sizes of the 
sets in Bi+i — Bi. An element may have copies in Bi+i, so there could be more elements than balls in Bi+i — Bi. But 
this is all right, we can still allocate balls according to Si+i — I3i, while just making sure that each bin has no more 
balls than their corresponding set in Bi+i. This way, we can ensure that the cost of each shuffle is always no more 
than the element transition cost of each transition. Therefore, we obtain a solution to the ball-shuffling problem with 
cost 0{uB). This completes the reduction. 

4 Proof of Theorem 2 

Proof of part (i). We first prove part (i) of the theorem. We will take an indirect approach, proving that any algorithm 
that handles the balls with an average cost of u using t bins cannot accommodate (2t)^" balls or more. This means 
that n < (2t)^", or u > 2 iog(2f ) ' total cost of the algorithm is un ~ Q{n logj n). 

We prove so by induction on u. When u = 1, clearly the algorithm has to put every ball into an empty bin, so with 
t bins, the algorithm can handle at most t < (2t)^ balls. We will use a step size of i for the induction, i.e., we will 
assume that the claim is true for u, and show that it is also true for u + ^. (Thus our proof works for any u that is a 
multiple of i; for other values of u, the lower bound becomes (2i)r^"l, which does not affect our asymptotic result.) 
Equivalently we need to show that to handle (2<)^"+^ balls, any algorithm using t bins has to pay an average cost of 
more than perbafl, or {u + i)(2t)2"+i = {2tu + t){2tf'^ in total. We divide the (2f)2"+i bafls into 2t batches 
of (2t)^" each. By the induction hypothesis, to handle the first batch, the algorithm has to pay a total cost of more than 
w(2t)^". For each of the remaining batches, the cost is also more than u(2i)^", plus the cost of shuffling the existing 
balls from previous batches. This amounts to a total cost of 2iu(2i)^", and we only need to show that shuffling the 
balls from previous batches costs at least t(2t)'^" in total. 

If a batch has at least one ball that is never shuffled in later batches, it is said to be a bad batch, otherwise it is a 
good batch. The claim is that at most t of these 2t batches are bad. Indeed, since each bad batch has at least one ball 
that is never shuffled later, the bin that this ball resides in cannot be touched any more. So each bad batch takes away 
at least one bin from later batches and there are only t bins. Therefore there are at least t good batches, in each of 
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which all the (2t)^" ball have been shuffled later. This costs at least t{2t)'^^, and the proof completes. 

The merging lemma. Part (i) of the theorem is very loose for small values oft.lft < a log n where a is an arbitrary 
constant, we can prove a much higher lower bound, which later will lead to the most interesting branch in the query- 
update tradeoff (2) of range queries. The rest of this section is devoted to the proof of part (ii) of Theorem 2, and it 
requires a much more careful and direct analysis. 

We first prove the following lemma, which restricts the way how the optimal algorithm might do shuffling. We 
call a shuffle that allocates balls back to more than one bin a splitting shuffle, otherwise it is a merging shuffle. 

Lemma 1 There is an optimal algorithm that only uses merging shuffles. 

Proof: For a shuffle, we call the number of bins that receive balls from the shuffle its splitting number. A splitting 
shuffle has a splitting number at least 2, and a merging shuffle's splitting number is 1. For an algorithm A, let ^{A) be 
the sequence of the splitting numbers of all the n shuffles performed by A. Below we will show how to transform A 
into another algorithm A' whose cost is no higher than that of A, while tt{A') is lexicographically smaller than ti{A). 
Since every splitting number is between 1 and t, after a finite number of such transformations, we will arrive at an 
algorithm whose splitting numbers are all 1 , hence proving the lemma. 

Let A be an algorithm that uses at least one splitting shuffle, and consider the last splitting shuffle carried out by 
A. Suppose it allocates balls to k bins. A' will do the same as A up until its last splitting shuffle, which A' will 
change to the following shuffle. A' will collect balls from the same bins but will only allocate them to fc — 1 bins. 
Among the — 1 bins, fc — 2 of them receive the same number of balls as in A, while the last bin receives all the 
balls in the last two bins used in A. Observe that since the bins are indistinguishable, the current status of the bins is 
only determined by their sizes. So the only difference between A and A' after this shuffle is two bins, say 6i, 62 of A 
and h'l, b'2 of A' . Note that the cost of this shuffle is the same for both A and A! . After this shuffle, suppose we have 
hi = x,b2 = y, b[ = X + y, h'2 = for some x, ?/ > 1. Clearly, no matter what Al does in the future, we always have 
7r(^') lexicographically smaller than -k^A). 

From now on A! will mimic what A does with no higher cost. We will look ahead at the operations that A does 
with h\ and 62, and decide the corresponding actions of A! . Note that A will do no more splitting shuffles. Consider 
all the shuffles that A does until it merges h\ and &2 together, or until the end if A never does so. For those shuffles 
that touch neither h\ nor 62, Al will simply do the same. Each of the rest of the shuffles involves h\ but not 62 (resp. 
&2 but not h\). Since the bins are indistinguishable, for any such merging shuffle, we may assume that all the balls are 
put back to 61 (resp. &2)- Suppose there are a\ shuffles involving hi and 02 shuffles involving &2- Assume for now that 
Oi < 02. Al will do the following correspondingly. When A touches h\, Al will use h'-{, and when A touches 62. 
will use 62- Clearly, for any shuffle that involves neither h\ nor 62. the cost is the same for A and Al . For a shuffle that 
involves h\ but not h2, since before A merges 61 and &2. we have the invariant that h\ ^ bi + y. A' pays a cost of y 
more than that of A, for each of these ai shuffles. For a shuffle that involves 62 but not bi, since we have the invariant 
that 62 = 62 — y. A' pays a cost of y less than that of A, for each of these 02 shuffles. So A' incurs a total cost no 
more than that of A. In the case ai > 02, when A touches bi. A' will use ft?,; and when A touches 62, A' will use 
b[. A similar argument then goes through. Finally, when A merges bi and 62 together (if it ever does so). A' will also 
shuffle both b'l and b'2. Since we always have 61+62 = b[ + b'2, the cost of this shuffle is the same for A and A! . After 
this shuffle, A and A! are in the same status. Thus we have transformed A into AI with no higher cost while 7r(^') is 
strictly lexicographically smaller than t^{A). Applying such transformations iteratively proves the lemma. □ 

The recurrence. Now we are ready to prove part (ii) of Theorem 2. Our general approach is by induction on t. Let 
/f (n) be the minimum cost of any algorithm for the ball-shuffling problem with n balls and t bins. Let a be an arbitrary 
constant. The induction process consists of two phases. In the first phase, we prove that /t(n) > c\tv}^'^'^l^ — 2tn for 
all t < to ^ [cq In n\ , where cq , ci and C2 are some small constants to be determined later In phase two, we prove 
that ft{n) > ciion^"''^^/'-*'''*'^*"*''-'''"-' — 2tn for all to < t < a\nn. Finally we show how to choose the constants 
Cq, ci, C2 such that /f (n) is always at least fi(tn^+^^(^/*'). 

The base case of the first phase < = 1 is easily established, since the optimal algorithm is simply adding the balls 
to the only bin one by one, yielding /i (71) = ^n{n + 1) > ci-n}^^^ — 2n, provided that we choose ci < 1/2, C2 < L 

By Lemma 1, there is an optimal algorithm A for shuffling n balls with t + 1 bins where A only uses merging 
shuffles. Since the bins are indistinguishable, we may assume w.l.o.g. that there is a designated bin, say bi, such that 
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whenever bi is shuffled, all the balls are put back to bi. Suppose when handling the last ball, we force A to shuffle all 
the balls to bi, which costs n. We will later subtract this cost since A may not actually do so in the last step. 

Suppose A carries out a total of k shuffles involving bi (including the last enforced shuffle), and with the i-th 
shuffle, bi increases by Xj > 1. It is clear that J2i=i ^ claim that the total cost of A, ft+i{n), is at least 

Mxi) + ft{x2) + ■■■ + ft{xk) +(^-'^xi+{k-l-^X2 + ---+(l-'^ Xk - 2n. (3) 

Consider the i-th shuffle involving bi. This shuffle brings Xi balls to bi, including the new ball just added in this 
step. Let us lower bound the cost due to these xi balls. First, those — 1 old balls must not have been in bi before, 
since whenever A shuffles bi, all the balls will go back to bi. So A must have been able to accommodate them using 
the other t bins. This costs at least ft{xi — 1), even if ignoring the cost of shuffling the other existing balls in these t 
bins. Then these Xi — 1 balls, plus a new ball, are shuffled to bi. This costs Xi, not counting the cost associated with 
the existing balls in bi. Finally, these Xi balls will be in bi for all of the remaining k — i shuffles involving 6i, costing 
{k — i)xi. Thus, we can charge a total cost of 

ft{x,~l) + x^ + {k-i)x^ = ft{x^-l) + l + j+(^k-i + l- x,-l > ft(xi)+(^k-i + l-^^ x,-l (4) 

to these Xi balls. That ft{xi — 1) + 1 + Xi/ 1 > ft{xi) easily follows from the observation that, to handle Xi balls with t 
bins, we can always run the optimal algorithm for Xi — 1 balls with t bins, and then put the last ball into the smallest bin, 
which will cost no more than 1 + (xi — l)/i < 1 + Xi/t. Finally, summing (4) over for all i, relaxing a —k to ~n, and 
subtracting the cost of the enforced shuffle proves that (3) is a lower bound on ft+i{n) for given k,xi, . . . ,Xk- Thus, 
ft+i{n) is lower bounded by the minimum of (3), over all possible values of fc, xi, . . . , Xk, subject to = n. 

We first use this recurrence to solve for f2{n)- 

f-iin) > min {fi{xi) + ■ ■ ■ + fi{xk) + {k ~ l)xi + ■ ■ ■ + Xk-i - 2n} 

k,xi-{ \-Xk=n 

= mill {l-xi{xi + I) -\ ^ ^Xk{xk + I) + {k - l)xi A \-Xk-i~2n} 

k,xi-\ \-x^—n Z Z 

> min jifcf^V + i(fc-l)fc-27il > \n'^/^-2n. 
fe[_2\K/ 2 J 4 

So if we choose ci < 1/4, C2 < 2/3, we have ft{n) > citn^+''^^* - 2tn for t = 2. 
For t > 2, we relax the recurrence as 

ft+i{n) > min I ft{xi) -\ h /t(a;fc) + ( fc - ^ ) Xi + ( fc - 1 - i ) a::2 H h ^Xfc - 2n 



k,xi-\ \-Xk=n I \ 2 J \ 2 1 2 

> min {ft{xi) + --- + fi{xk) + l{kxi + {k-l)x2 + --- + Xk)-2n}. (5) 

fc,a:iH \-Xj^—n Z 



The induction, phase one. In phase one, we have 1 < t < to — 1 for = L^o In "^J ■ Ths base cases t = 1 , 2 
have already been established. Assuming the induction hypothesis /t(n) > citn^'^'^^^* — 2tn, we need to show 
ft+iin) >ci{t + l)ni+=2/(*+i) - 2{t + l)n. From (5) we have 

ft+i{n) > min {citx\^'^^^^ — 2txi + • • • + citx]^'^^^^ — 2txk + ^.ikxi + • • • + Xk) — 2n\. (6) 

k,xi-\ yxk=n 2 

Let gk{n) be the minimum of (6) for a given k. Then clearly ft+i{n) > mini<fc<„ gk{n), and we will show that 

9k{n) > ci(t + l)ni+'==/(*+i' - 2(i + l)n (7) 
for all fc, hence completing the induction. 
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We prove so using another level of induction on k. For the base case fc = 1, we have gi{n) > ciin^+^^/* — 2tn 
\n-2n> cii7ii+=^/* - 2(t + l)n, and citn^+''^^* > ci{t + holds as long as 



1 / 1^* 



tn t > {t + l)n'+i <^ n*Tw7 > 1 + - ^ n*+^ > + - j <= n'+i > e ^ f < C2 Inn - 1. 

So if we choose cq < C2, then for the range of t that we consider in phase one, (7) holds for k = 1. 

Next, assuming that (7) holds for k, we will show gk+i{n) > ci{t + l^n^+'^^/it+i) _ 2(f _]_ i^^j. By definition. 



gk+i{n) = min {atxl'^''^^* - 2txi ^ h citxlV'i^* - 2txk+i + l-{ik + l)xi -\ \- Xk+i) ~ 2n} 

siH \-Xk + i=n ^ 2 

= minjcita;!,^?^* — 2tXk+i H — n + min {citx]^'^^^* — 2txi + ■ • ■ + Citx}^'^^^* ~ 2txk 

Xk + l 2 xiA VXk=n-Xk+i 



+ ]^{kxi H h Xfe) - 2{n - Xk+i)} - 2xk+i} 

min{cite!,^';^^* - 2[t + l)xk+i + \n + gk{n - Xk+i)} 
Xk+i ^ 2 



> min 

Xk+l 



{citel+f /* - 2{t + l)xk+i + l-n + ci{t + l){n - - 2{t + l){n - Xk+i)} 



min{citei+f /* + in + ci(< + l)(n - Xk+if+''/^'+^'> - 2{t + l)n}. 



Xk+l 



Setting Xk+l = An where < A < 1, we will show 



Cit(An)i+'=^/* +ci{t + 1)((1 - A)n)^+'=^/(*+i) + ^n > ci{t + l)ni+^=/(*+i) 



(8) 



for all A. (8) is equivalent to 

t 



Ai+-nWTy + (1 - A)i+-ETr + ^ > i. (9) 

Since (1 - A)^+*tt > (1 - A)^+^, to prove (9), it suffices to prove 

* n^Ai+^ + (l-A)i+^>l-— (10) 



t+1 y J - 2ci(t + l)n'=2/(t+i) 

The LHS of (10) achieves its only minimum at the point where its derivative is zero, namely when 



or 



t+1 \ t J \ t 

ni/(*+i)A = 1-A, 



t 



< + 1 



^ (_A_)t/c2„l/(t+l)+l- ^^^^ 



Plugging (11) into the LHS of (10) whHe letting 7 = ( j^)*/'=2ni/(*+i', we get 



(^+l)l + C2/i (7+l)l + '=2/t (^+l)C2/i V7+I 

Considering the RHS of (10), since n^^/^'+i) = I'^'^i^ f < n"^, we have 



C2/t 



= 1-— 7^^<1-- 7— <1- 



2ci(i + l)n^2/(i+i) 2ci(f + 1)7=2 («±i)t 2cie(t + 1)7'=^ 4ciet7 
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Thus, to have (10), we just need to have 



7 Y'/* 1 
' ^ > 1 - 



7 


7 


+ 


1 




7 




7 


+ 


1 






1 


1 


+ 








7 






1 


1 


+ 








7 



1 / 1 



7 + 1 / Acietj 
or — ^ > (^1 - , ^. I = ( 1 - 
> cxp 



1 



4ciC2e7'^2 
1 

4ciC2e7°2 
1 



4ciC2e7'^2 ' 

where the last inequality holds if 7 > 4ciC2e7'^^, or 7 > (4ciC2e)^/'-^^''^-'. Finally, since 

7 = /(I + > ni/(*+iVe'/^^ > ni/*7ei/=^ > gi/co-i/c^ 

as long as we choose cq small enough depending on ci and C2, such that e^/'^""^/'^^ > (4ciC2e)^/^^~'^^-', (10) will 
hold, and henceforth gk+i{n) > ci{t + 1)^1+'^^/'^*+^^ This also completes the induction on t for phase one. Finally, 
to ensure citn^+'^^/* — 2tn ~ ri(tn^+^^(^/*') for t < to, it suffices to have cin'^^/*° = cie'^^^'^" > 2, which again can 
be guaranteed by choosing cq small enough. 

The induction, phase two. The derivation for phase two is similar to that of phase one, and is given in the appendix. 
Combining the results of phase one and phase two we have proved part (ii) of Theorem 2. 

Tightness of the bounds. Ignoring the constants in the Big-Omega, the lower bound of Theorem 2 is tight for nearly 
all values of t. Now we give some concrete strategies matching the lower bounds For t > 2 log n, we use the following 
shuffling strategy. Let x = t/ log n >2. Divide the t bins evenly into log^. n groups oft/ log^ n each. We use the first 
group to accommodate the first t/ log^ n balls. Then we shuffle these balls to one bin in the second group. In general, 
when all the bins in group i are occupied, we shuffle all the balls in group i to one bin in group i + 1. The total cost 
of this algorithm is obviously n log^ n since each ball has been to log^ n bins, one from each group. To show that this 
algorithm actually works, we need to show that all the n balls can be indeed accommodated. Since the capacity of 
each group increases by a factor of t/ log^, n, the capacity of the last group is 



t V°^-" / xt V°s-" / t V°^-" /logn^'°«-" 



= n 



= ?i(loga;)'°s-" > 



Thus, part (i) of Theorem 2 is tight as long as log(t/ logn) = Q{\ogt), or t = Jl(log^'''^ n). 

Part (ii) of the theorem concerns with t = O (log 71). For such a small t we need to deploy a different strategy. 
We always put balls one by one to the first bin bi. When 61 has collected 71^/* balls, we shuffle all the balls to 62- 
Afterward, every time 61 reaches ti^/*, we merge all the balls in bi and 62 and put the balls back to 62- For &2, every 
time it has collected 71,^/* balls from 61, we merge all the balls with 63. In general, every time bi has collected n'/* 
balls, we move all the balls to fo^+i. Let us compute the total cost of this strategy. For each shuffle, we charge its cost 
to the destination bin. Thus, the cost charged to 61 is at most (77,^/*)^ • 77,^"^/* ~ 77,^+^/*, since for every group of 77^/* 
balls, it pays a cost of at most (ri^/*)^ to add them one by one, and there are n^"^/* such groups. In general, for any 
bin bi,l < i < t, the balls arrive in batches of n^'^^)/*, the bin clears itself for every 77^/* such batches. The cost for 
each batch is at most 77,*/*, the maximum size of bi, so the cost for all the 77^/* batches before 6,; clears itself is 77^*+^'/*. 
The bin clears itself 77/77*/* = 77^^*/* times, so the total cost charged to bi is 77^+^/*. Therefore, the total cost charged 
to all the bins is tn^^^/*. 

Combining part (i) and part (ii), our lower bound is thus tight for all t except in the narrow range w(log n) < t < 
o(log^^^ 77). And in this range, the gap between the upper and lower bounds is merely Q( iog(t/^iogn) ) = o(log log 77). 
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A The induction, phase two 

In phase two, we will prove that /t(n) > citon^+'^^/'-*°+'^°*^*~*°-'/"^ — 2tn for to < t < a Inn where a is any 
given constant. To simplify notations we define h{t) = to + co{t — to) /a. The base case t — to for phase two 
has already been established from phase one. Next we assume ft{n) > citon^'^'^^/'^'^^^ — 2tn and will prove that 
ft+i{n) > Ciioni+^^/''(*+i) - 2{t + l)n. 

From the recurrence (5) and the induction hypothesis, we have 

ft+i{n) > mill {citoa;i+''/''^*^ - 2txi + ■■■ + ciioa;^^''/''^*^ - 2txk + hkxi + ■ ■ ■ + Xk) - 2n}. (12) 

k.Xi-\ \-Xi^—n 2 

Similarly as in phase one, let (n) be the minimum of (12) for a given k. Here we need to show that 

gk{n) > citoni+'=^/''(*+i) - 2(f + l)n. (13) 

Again we use induction on k to prove (13). The base case is easily seen as gi{n) = ci^o'^^^^^^'''-*'' — 2tn + in — 
2n > ciion.^+''"/''^*+^^ - 2(i + Now suppose (13) holds for fc, we will show gk+i{n) > €1^0"-^+""^''^*+-^^ - 
2{t + l)n. By the induction hypothesis, we have 

gk+i{n) = min {citoxl^"^^'''''*^ - 2txi -\ \- citoxl\T^'^'^*^ - 2txk+i 

xi-i (-a;j, + i=ri '-^^ 

+ i((fc + l)xi + ■■■ + Xk+i) - 2n} 
= minjcitoa^i.ii^^'''"*'' — 2txk+i + —n + min {citox]^'^^^'^^*'' — 2txi + 

Xk + l 2 siH i'Xk=n-Xk+i 

■■■ + citoxl+''^''^'^ - 2txk + ]^{kxi + • • • + Xfe) - 2{n - Xk+i)} - 2xk+i} 
= min{citoa;J;^1;f - 2{t + l)xk+i + \n + gk{n - Xk+i)] 

Xk + l I 

> miniciio^+i'^"'^*' - 2(t + l)xk+i + + cM^ - Xfe+i)i+'=^/''(*+i) - 2{t +l){n- Xk+,)} 



Xk + l 

min 



in{citoa;i+f + j-n + cM^ - xu+if+--'"^'+''^ - 2{t + l)n}. 



Setting Xfe+i = An where < A < 1, we will show 



cito(An)i+^=/''« + ciio((l - A)n)i+==/''(*+i) + ^n > c^tov}^^-'^"^'^^^ 



(14) 



for all A. (14) is equivalent to 



^ ' 2citon''2/''(-*+^) ~ 

Since (1 - \)^'^T^(tTTy > (1 - A)^'*'"^^, to prove (15), it suffices to prove 

nHt)h(t + i) h(t) J^. n ^ Xy+ h(t) >1 TTFTTTT- (16) 



The LHS of (16) achieves its only minimum when 

'=2'=o/° / 
jlh(t)h{t+l) I ^ 



'=2'=0/a / C2 \ , "2 / C2 \ , , , "2 



h{t) J V Kt) 

or n''('+i)A = 1 - A, 
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A = 



(17) 



Plugging (17) into (16) while letting 7 = n''<*+i> , (16) becomes 



7+1 



C2/h{t) 



> 1 



7 


7 


+ 


1 




7 




7 


+ 


1 




7 




7 


+ 


1 






1 


1 


+ 








7 



> 



1 



> exp 

> cxp 
< 1 



2ciio7''""/'''' ' 
1 

2cito7'=2"/'=" 
hit) 



h(t)/c2 



1 



2<;itoT'^2°/gOfi(t) 
2 = lto-T=2°/<=Oc2 



2ciC2to7''^"/''" 

1 

2ciC27'=2"/=o 

1 



2ciC27'=2"/'=0 ' 

where the last inequality holds if 7 > 2ciC27'^^"^^''. We will choose C2, cq such that C2a/co > 1, thus this becomes 

1 1 "O/" "0/°' 1/ / 1 1 

7 < ( 2^^^^ ) '=2°/^o-i . Since 7 = 71 < ?i<=oin" = e '", we just need to have e'" < ( 2ciC2 ) "^"^""^^ to make 
sure that (16) holds. This would also complete the induction on t for phase two. 

We also need to ensure that citon^+'"^^'''^*'> - 2tn > ciCo/a ■ tn^+'^^/'^W ^ n{tn^+^'^^^/*'>) for phase two. This 



just requires cico/a • jf'^l^^^^ > 2. Since cico/a • jf'^l^^^^ > cico/a ■ <:o/°)inii _ cico/a ■ 6^"° "0^°, we just 



require ci Co /a • e^"" "0^" > 2. 

Finally, we put together all the constraints that we have on the constants: 

r ci < 1/2, C2 < 1, 

ci < 1/4, C2 < 2/3, 

Co < C2, 

(4ciC2e)i/(i-'=2) < gi/co-i/c2^ 
2 < cie^^/^", 
ei/"<(2^)^5^, 

^2 

, 2 < ciCo/a ■ e^^o-^a^" . 

We can first fix ci = C2 = 1/4. This makes (4ciC2e)^/^^~°^-' < 1. Then we choose cq small enough such that the 
third and the fifth constraints are satisfied. That cq < ci also makes e^/"^""^/^^ > 1, satisfying the fourth constraint. 
Finally, we will make co even smaller if necessary (depending on a), to satisfy the last two constraints. This completes 
the proof of part (ii) of Theorem 2. 
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